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Abstract —Image representation and classification are two 
fundamental tasks towards multimedia content retrieval and 
understanding. The idea that shape and texture information 
(e.g. edge or orientation) are the key features for visual rep¬ 
resentation is ingrained and dominated in current multimedia 
and computer vision communities. A number of low-level fea¬ 
tures have been proposed by computing local gradients (e.g. 
SIFT, LBP and HOG), and have achieved great successes on 
numerous multimedia applications. In this paper, we present 
a simple yet efficient local descriptor for image classification, 
referred as Local Color Contrastive Descriptor (LCCD), by 
leveraging the neural mechanisms of color contrast. The idea 
originates from the observation in neural science that color 
and shape information are linked inextricably in visual cortical 
processing. The color contrast yields key information for visual 
color perception and provides strong linkage between color and 
shape. We propose a novel contrastive mechanism to compute 
the color contrast in both spatial location and multiple channels. 
The color contrast is computed by measuring /-divergence 
between the color distributions of two regions. Our descriptor 
enriches local image representation with both color and contrast 
information. We verified experimentally that it can compensate 
strongly for the shape based descriptor (e.g. SIFT), while keeping 
computationally simple. Extensive experimental results on image 
classification show that our descriptor improves the performance 
of SIFT substantially by combinations, and achieves the state-of- 
the-art performance on three challenging benchmark datasets. It 
improves recent Deep Learning model (DeCAF) Qj largely from 
the accuracy of 40 . 94 % to 49 . 68 % in the large scale SUN397 
database. Codes for the LCCD will be available. 

Index Terms —Shape information, color contractive descriptor, 
/-divergence, multiple color channels. 


I. Introduction 

Image representation has long been an active yet challenging 
topic in multimedia community. It is a fundamental task for 
image content understanding, and plays a crucial role on 
the success of numerous image/video related applications, 
such as image categorization m, ea, a, 0, object detec- 
tion/recogntion a. 0, a. action recognition 0, Da. and 
image segmentation DU. D2. urn For the last two decades, 
a large amount of research efforts have been devoted to design¬ 
ing an efficient local descriptor for image/video representation. 
The state-of-the-art descriptors are mostly based on shape 
descriptions, such as edges, corners or gradient orientations, 
while discarding color information. Typical examples along 
this line include Scale Invariant Feature Transform (SIFT) 
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DU, Local Binary Pattern Descriptor (LBP) fl5l . fl6l . His¬ 
togram of Orientated Gradient (HOG) lfT7il . Region Covariance 
Descriptor (RCD) JT8), lfl9l . and PixNet visual features J20). 
They have been widely applied for numerous multimedia and 
image/vision applications with great success achieved. Most of 
these descriptors are designed for gray images. Obviously, the 
concept that local shape or gradient information are the key 
features for image representation is ingrained and dominated. 
It makes sense in the way that people can easily understand the 
contents or actions in a black and white movie, like a Charlie 
Chaplin film, without knowing its tme color. 

However, recent observations from visual neuroscience in¬ 
dicate that shape/form information is not the only visual 
property of objects and surfaces, but rather color and shape are 
inextricably linked as properties of scene in visual perception 
and visual cortical processing ll23l li24) l25l . In order to account 
for color information for image representation, Mindm et.al. 
Il26ll proposed a combined color and shape feature of local 
patch based on color moments, which is shown to be invariant 
to illuminance changes. In l?7l . the description of local feature 
is extended with color information in an effort to increase 
its robustness against photometric changes and varying image 
quality. To increase the photometric invariance and discrimina¬ 
tive power, a number of color descriptors based on both color 
histogram and SIFT are systemically reviewed and evaluated 
in lf28l . including the color SIFT. However, the application of 
color information for image description has received relatively 
much less attentions in multimedia research mm, mainly 
due to the large amount of variations in real-world scene which 
may significantly increase the difficulty of robustly measuring 
color information. 

The goal of this paper is to enrich local image representation 
with color and contrast information by presenting an efficient 
local descriptor. The key issue lies in how to robustly extract 
efficient color contrast information which could effectively 
interact with and strongly compensate for the shape informa¬ 
tion. It has been observed in visual neuroscience that color 
perception of a region sometime is more dependent on color 
contrast at the boundary of the region than on the spectral 
reflectance of the region l23l . Color contrast can have a major 
effect on color perception, and color and shape interact through 
the spatial layout of surrounding them relative to a target 
region |23l . 

Motivated by these biological findings, we develop a novel 
local descriptor, named as Local Color Contrastive Descrip¬ 
tor (LCCD). Our LCCD descriptor utilizes /-divergence to 
effectively measure the color contrast, which enables it with 
strong ability to robustly describe local contents of the image. 
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Fig. 1. Top: the low-level filters (from the first convolutional layer) learned by Deep Convolutional Neural Network ED- Bottom: shape and opponent 
information in human visual system (22) The low-level filters of the deep CNN contain both edge/shape filters and color constrictive information, which are 
highly consistent with the shape and opponent information in human visual system. Both of them validate our clam theocratically that color contrast plays a 
crucial role in image description. 


The /-divergence yields a class of measures between local 
distributions, and has been successfully used as features for 
robust speech recognition lf29l . The LCCD differs distinctly 
from most current local color descriptors, which often extract 
color features from each region independently, while discard¬ 
ing important correlation information between neighboring 
regions. The region-based contrastive information enhances 
spatial locality of the LCCD, and increases robustness against 
noises which are easily caused by single-pixel operations. 
Finally, we demonstrate efficiency and effectiveness of the 
LCCD descriptor for image classification. The main contri¬ 
bution of the paper is summarized as bellow. 

1) We propose a novel local descriptor, the LCCD, by 
leveraging the color contractive feature. We develop a novel 
mechanism to measure color contrast of the image region that 
detects local contrastive information in both spatial location 
and multiple color channels. We find that the proposed color 
contrast is highly consistent with the structure of low-level 
filters learned by deep convolutional neural network, as shown 
in Figure 1. 

2) We introduce/-divergence to robustly measure the differ¬ 
ence of color distributions between local regions. It has been 
shown that the /-divergence measure is invariant to invertible 
transformations l30l . We leverage this appealing property to 
enable the LCCD with strong robustness against multiple local 
image distortions. 

3) We propose subspace extension to compute the /- 
divergence measure between different feature histograms. This 
improvement enables our descriptor with stronger capability 
for capturing more detailed information from the images, 
which increases its discriminative power considerably. 

4) We show experimentally that the color contrastive in¬ 
formation can strongly compensate for gradient-based SIFT 
by substantially improving its performance through combina¬ 
tion. The LCCD descriptor with SIFT achieves remarkable 
results on three benchmarks: the MIT Indoor-67 database 
EU, SUN397 [32) and PASCAL VOC 2007 standards (33). 
advancing the state-of-the-art results considerably. 

The rest of paper is organized as follows. Section 2 briefly 
reviews related studies. Details of the proposed LCCD are pre¬ 


sented in Section 3, including descriptions of the /-divergence, 
spatially and channelly contractive features, and the subspace 
extension. Experimental results are presented in Section 4, 
followed by the conclusions in Section 5. 

II. Related Works 

The local image descriptor has been an active research topic 
in image and multimedia communities in the last years. Lowe 
D! proposed powerful Scale Invariant Feature Transform 
(SIFT) descriptor by computing a 3D histogram of gradient 
location and orientation. The spatial location is divided into 
a 4 x 4 grid and the gradient angle is quantized into eight 
orientations. The SIFT descriptor is computed based on ap¬ 
pearance of the object at particular interest points, and is 
invariant to transform of scale and rotation. It is robust against 
changes in illumination, noise, and viewpoint. Although the 
SIFT is powerful for image description by computing gradient 
features, it does not exploit the color information, leading to 
a less informative representation. 

Local Binary Pattern (LBP) |fl5l and its extensions have 
achieved great successes on texture description (34l and face 
recognition 03, ED, (36). It labels image pixels by thresh¬ 
olding neighborhood of a central pixel to generate a binary 
string for feature representation. It has been widely applied for 
face and texture recognition due to its high performance and 
computational simplicity. However, the LBP involves pixel- 
level operation to compute the binary features, which largely 
limits it robustness against noise and multiple local image 
distortions. 

Recent effort focuses on developing mid/hight-level models 
by encoding multiple low-level features for image description, 
such as Bag-of-Features (BoF) 137) . Object Bank (OB) [38), 
and Bag-of-Parts (BoP) models l39l . The OB is a high-level 
image representation where an image is represented as a scale- 
invariant response map of pre-trained generic object detectors 
1381 . The BoP, building on the HOG feature, automatically 
detects distinctive parts from scene image for recognition 
||39l . Furthermore, a number of recent descriptors achieved 
impressive results by combining with the SIFT and SPM 
model. For example. Orientational Pyramid Matching (OPM) 
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BOl utilizes 3D orientations of objects to form the pyramid and 
produce the pooling regions. It shows strong complementary 
abilities to the SIFT and SPM so that the combination of them 
achieved excellent performance in scene recognition. However, 
similar to the SIFT, these descriptors do not include either 
color or local contrastive information. We will show that these 
features work as important complementary information for 
image representation. 

There are few works to apply color information for de¬ 
signing local image descriptors. Local Color Statistic (LCS) 
descriptor El was proposed by computing the means and 
standard deviations of the 3 RGB channels from 16 sub- 
regions, which results in a 96-dim color feature. It was com¬ 
bined with SIFT descriptor for Fisher Vector encoding, and 
achieved remarkable performance for scene image classifica¬ 
tion l42l . In |28| , color descriptors based on color histograms 
and moments, and color SIFT were proposed and discussed. 
It has been demonstrated experimentally that combination 
of multiple color descriptors and SIFT leads to significant 
performance improvements on image/video classification ll28l . 
Our work is related to these descriptors by leveraging the color 
information for image description. By contrast, we compute 
both color and local contrastive information of the image 
so that encode richer local features. Therefore, our LCCD 
descriptor provides a more principled approach for measuring 
the color information, which sets it apart from all color 
descriptors of the past. 

III. Local Color Contrastive Descriptor 

This section presents details of the proposed Local Color 
Contrastive Descriptor (LCCD). We first introduce the /- 
divergence measurement which is applied for robustly com¬ 
puting the contrastive characteristic between two local regions. 
Then the LCCD descriptor is constructed by two types of 
contrastive features: local spatially-contrastive and channelly- 
contrastive features. The two features extract contrastive in¬ 
formation from spatial locality and multiple color channels, 
respectively. Finally, a subspace extension is developed to 
further enhance its discriminative ability. 

A. F-divergence for Contrastive Measurement 

Computing contractive information between two feature 
distributions of local regions servers as the basic component 
of the LCCD descriptor. Traditionally, classical L p (e.g. L\ 
or L 2 ) distance is used to compute the difference (dissimi¬ 
larity) between two feature vectors. However, the family of 
/-divergence has been shown to be more suitable to measure 
the contrastive information, due to its robustness to transfor¬ 
mations |30l . It has been widely applied in statistical learning 
and information theory, and also has achieved excellent results 
on robust speech recognition recently l43l . 

In statistics and information theory, Csiszdr /-divergence 
m (also known as Ali-Silvey distance B3) measures the 
difference (dissimilarity) between two distributions. Formally, 
/ : (0, 00 ) —> R is a real convex function and /(1) = 0, Pi(x) 
and fij (x) are density functions of two distributions defined 
on measurable 5ft. Then 


Df (Pi,Pj ) = [ Pj{x)f( — )dx (1) 

Jx Pj 

To insure the/-divergence between two identical distribution 
is zero, Df(p,p ) = 0, we set constraint /(1) = 0 (44). It has 
been proved that many well-known distances or divergences in 
statistics and information theory, such as KL divergence, Bhat- 
tacharyya distance, Hellinger distance, etc., can be regarded 
as one of special cases of the/-divergence measure l43l . l46l . 
depending on the choice of function / The detailed functions 
of them are presented in Table 1. The choice of the / will be 
discussed as bellow. 

TABLE I 

Definitions of distances or divergences from the f-divergence 

FAMILY 


Distance/divergence 

Definition 


Bhattacharyya distance 

- ln f YfiPjdx 

\/t 

KL-divergence 

f pi ln j^rdx 

ilog(i) 

Symm. KL-divergence 

f(Pi Infi +Pj ln Jtr) dx 

t log(t) - Iog(t) 

Hellinger distance 

5 fis/Pi- VPj) 2dx 

h(Yi- i) 2 

Total variation 

I 1 Pi ~Pj\dx 

|t- i| 

Pearson divergence 

/ fr(Pi ~ Pjfdx 

it - i) 2 

Alpha divergence 

a (i/„) /(l fPiPj )dx 

£ , 1 t a 

1-0 + c c(l-c.) 


It has been shown that the /-divergence has a number of 
remarkable properties El, El. One of its advantages is 
that the /-divergence between two distributions is invariant 
to transformations l43l . Consider a feature space 5ft and two 
distributions Pi(x) and Pj(x) in 5ft (x £ 5ft). Let g : X — > 
Y (linear or nonlinear) denotes an invertible transformation 
function, which maps x into a new feature y. By this way, 
the distributions pi(x) and Pj(x) are transformed to qi(y) 
and qj(y). We aim to seek an invariant measurement D that 
D(pi,pj) = D(q.i,qj), based on the following theorem [|30l . 


Theorem. The f-divergence between two distributions is in¬ 
variant under an invertible transformation g on the space 5ft. 


Proof: With the invertible transformation g, we have y = 
g(x), so that the distribution qi(y) can be calculated by. 


<n(y) = Pi(g 1 {y))G(y), (2) 


where g ~ 1 denotes the inverse function of g, and G(y) is the 
absolute value of the determinant of Jacobian matrix of the 
g~ 1 {y). With dx = G(y)dy, we have. 


D f(Pi,Pj ) = 


/ 

/ 

/ 






yj{y)f{^r\)dy 


Qj(y)‘ 


= Df{gi,Qj) 


(3) 


This is an appealing property for image description, whose 
goal is to capture meaningful underlying local structure of 
the image, while being robust against multiple local image 
distortions. Therefore, we exploit Hellinger distance of the /- 
divergence as the basic function to compute the color contrast 
between local image regions. 
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Fig. 2. Pipeline of the proposed Local Color Contrastive Descriptor (LCCD), including (a) the Spatially-Contrastive Feature (LCCDg), and (b) the Channel- 
Contrastive Feature (LCCDp). (c) The computation of the SIFT descriptor in a local image path. The main differences between the LCCD and SIFT are 
demonstrated clearly. The SIFT computes the histograms of gradient information from each divided region independently, while our LCCD captures the 
spatial correlation feature between neighboring regions (by the spatially-contrastive feature), and encodes color contractive information between different 
image channels (by the channel-contrastive feature). Thus the LCCD descriptor provides strong complementary information for the SIFT by leveraging both 
color and contrastive information of the image. 


B. Local Color Contrastive Descriptor 

Motivated from the observation in visual neuroscience that 
contrastive information plays a crucial role on color percep¬ 
tion, we aim to enrich the image representation with the 
contrastive aspect of color and shape information. To this end, 
we explore this neural mechanisms of color contrast to design 
a new and powerful local image descriptor. The proposed 
Local Color Contrastive Descriptor (LCCD) is computed from 
an image patch by dividing it into a number of (sub) re¬ 
gions (cells), as shown in the left column of Figure 2. The 
region-based property of the LCCD increases its robustness 
against image noise which often affects the performance of 
the descriptors building on isolated pixel operation, such as 
the LBP based methods m, m, gs, E3. Besides, it takes 
into account the spatial layout of the image as well as the 
statistical properties computed from each region. In order to 
extract the contrastive features from both spatially neighboring 
regions and different image channels, the LCCD computes 
both spatially-contrastive and channelly-contrastive features 
based on the /-divergence measure. 

1) Spatially-Contrastive Feature: The spatially-contrastive 
feature computes the relative difference of statistical color 
features between neighboring regions. The image patch is 
transformed from the RGB space to the opponent color space 
as (28), 




(4) 


where channel 0 \ and 0 2 include the color information 
and O 3 has the intensity feature. We compute a spatially- 
contrastive feature from each channel. Specifically, as shown 


in Figure 2(a), an image patch (in one channel) is divided into 
3x3 = 9 regions. We compute a d-bin histogram feature from 
each region, which can be considered as a discrete probability 
distribution of the feature from this region. The contrastive 
feature is computed by measuring the /-divergence between 
the feature distributions of the central region (P) and its 8 
neighboring regions (Q = Q 1 , Q 2 ,..., Q 8 ) (shown in Figure 
2(a)), 


LCCDs = [h(P, Q 1 ), h(P , Q 2 ),..., h(P, Q 8 )], (5) 

where d(P, Q 1 ) is the Hellinger distance to measure the 
contrast between two feature histograms from P and Q l . It 
can be computed as, 

1 d f— 

h(P, Ql = 2 ^2(VPk - \J= 1 . 2 , ... 8 , (6) 

fc=i 

where k is the bin index of the d-bin histogram, pr and q\. 
are the values of the fc-th bins of the feature histograms of the 
central (P) and the z-th neighboring regions ((/), respectively. 
Therefore, we compute an 8-dimensional feature vector from 
an image patch (in a single channel). Each dimension of the 
vector corresponds to a contrastive value from a neighboring 
region. Finally, we calculate three such 8-dimensional features 
from Oi, ()‘2 and 0 3 channels respectively, and concatenate 
them to construct the final spatially-contrastive feature, with 
dimensions of 24D. The number of dimensions is significantly 
lower than 128D used by the SIFT. One may suggest to use 
a more complex distance function of the /-divergence, such 
as the Alpha divergence. But we found experimentally that 
other complex distances do not yield a significant better result. 
The Hellinger distance is simple, but it is effective enough 
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to capture the meaningful local contrastive structure of the 
images. 

2) Channel-Contrastive Feature: The channel-contrastive 
feature computes the feature contrast between different chan¬ 
nels of a same region. Similarly, the Hellinger distance is used 
to compute the/-divergence between the R, G, and B channels. 
As shown in Figure 2(b), we first extract a histogram feature 
from each region in both referred channels. Then we compute 
the /-divergence between two channels as, 



k =1 


where Q l x and Q' y are the i-th regions of the x and y channels. 
k is the bin index of the (/-bin histogram. For an image patch 
with 3x3 regions, we can get a 9-dimensional channelly- 
contractive vector as, 

LCCDc^y = [h(Ql, Ql), h(Ql, Q 2 y ), ..., h(Q%, Q®)], (8) 

The final channel-contrastive feature (LCCDc) is constructed 
by concatenating three channel-contrastive vectors computed 
between R and G, R and B, G and B channels, respectively. 
More discussions on the channel-constrictive feature are pre¬ 
sented in Section 3 .D. The proposed LCCD descriptor is 
constructed by using both spatially-contrastive and channel- 
contrastive features. 

C. Subspace Extension 

To enhance the discriminative capability of the LCCD, 
we introduce a subspace based method to compute a more 
meaningful feature from each image region. The subspace 
extension allows the LCCD to capture more detailed features 
from the image, and hence naturally yields extra important 
information for computing the color contrast, which is the key 
to discriminativeness. 

The subspace feature is computed based on the original 
histogram vector extracted from each region. Specially, as¬ 
sume that we have a d-bin histogram, the subspace vector is 
generated by moving a (ID) sub-window of size (length) di 
densely through the original d-bin histogram. By this way, the 
original histogram is now decomposed into multiple subspaces 
or sub histograms, each of which includes d\ bins. The number 
of the newly generated subspaces is d — d\ + 1. 

We compute each subspace contrast between two generated 
sub histograms independently. The Eq. (4) is extended as, 

1 j+d 1-1 __ 

= 2 E (V^-\K) 2 , P) 

k=j 

where j = 1, 2,..., {d — d\ + 1). Then the final contrastive 
feature ( h su b(P,Q *)) is constructed by concatenating all the 
subspace contrasts ( h ° sub ) computed between two considered 
regions (e.g. region P and Q’), 

KubiPiQ 1 ) = [/*L6( p .O i ).'*L6( p .O i ).---.^ dl+1 ( p .O i )]. d°) 

Therefore, the subspace-extended contrastive feature between 
two regions is a (d — d\ + 1)-dimensional vector, while the 
original non-subspace one is a single value computed by 



Fig. 3. Performance of the spatially-contrastive feature (LCCD 5 ) and 
channel-contrastive feature (LCCDc) with non-subspace and various dimen¬ 
sions of the subspace extension (on the MIT Indoor-67 database). 

Eq. (4). The subspace extension is involved in region level, 
hence it can be directly adopted to compute both spatially- 
and channel- contrastive features. In all our experiments, we 
empirically set the size of subspace window to di = 3. 

D. Analysis and Discussions 

To verify efficiency of both spatially-contrastive and 
channel-contrastive features, and the subspace extension, we 
utilize the MIT Indoor-67 database ED (the details of the 
database are described in Section 4) to compare the perfor¬ 
mance of the LCCDc,- and LCCDc with non-subspace and 
subspace extensions under various dimensions. Notice that the 
dimension of the LCCD with the subspace extension (for an 
image patch) is determined by the number of the original 
histogram bins (d), and the size (length) of its subspace 
window (</): d su b = d — d\ + 1. In our experiments, we 
set d = 10, 20, 30 with a fixed size of the subspace window, 
di = 3. Then we get d su b = 8,18,28. The results of both 
LCCDc,' and LCCDc are presented in Figure 3. It can be 
found that, the LCCD with subspace extensions consistently 
outperforms that of the non-subspace ones considerably in 
both spatially- and channel-constrictive features. The LCCDc 
performs slightly better than the LCCDg. The accuracies 
increase with increasing numbers of the subspace dimensions, 
and reach their stabilities at 18-dimensional subspaces in both 
cases. By trading off the performance and computational cost, 
we use the 18-dimensional subspace in all our following 
experiments. 

We further investigate the performance of the LCCDc,- and 
LCCDc separately in Figure 4(a), and their combinations 
with SIFT in Figure 4(b). In the Figure 4(a), the LCCDc 
computed from the RG and RB channels receive slightly 
higher performance than that of the GB channels, and the 
combination of three channel contrasts receives a significant 
improvement. The LCCDc, gets a slightly better accuracy than 
the LCCDc on each single contrast, but its performance is 
lower than that of the LCCDc with the combination of three 
contrasts. As expected, the combination of both the LCCDg 
and LCCDc achieves a further improvement over each single 
performance. 

In Figure 4(b), we can find that either single LCCDs or 
LCCDc can improve the performance of the original SIFT 
largely by combination, and the highest accuracy is obtained 
by combining the SIFT with both of them. As can be found, 
for the combination with SIFT, the LCCDc by using only RG 
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Fig. 4. (a) Performance of the LCCDg and LCCD^ in single RG, RB 

and GB channels, and their combinations, all with subspace extension, (b) 
Combinations of the LCCDg and/or LCCDc with the SIFT. 


and RB channels achieves a slightly higher accuracy than the 
LCCDc using three channel contrasts. We obtain similar re¬ 
sults when we conducted more experiments on other databases 
in the Section 4. We guess that the contrastive feature included 
between the GB channels may be relatively weak or highly 
redundant. Therefore, our final LCCD descriptor applied in 
all our following experiments only contains the LCCD 5 , and 
the LCCDc computed from the RG and RB channels. 

We discuss the fundamental difference between the pro¬ 
posed LCCD and SIFT 03), which has received great success 
for image description in last decade. The SIFT is extremely 
powerful for detecting robust shape information of the image 
by computing local gradient orientations. For comparing the 
LCCD with SIFT, we present the basic pipeline of the SIFT 
in Figure 2(c). The SIFT descriptor (for an image patch) 
is generated by concatenating 16 histograms, each of which 
is computed on gradient orientations from a divided region 
independently. It can be found clearly that the LCCD is 
different from the SIFT at two main aspects. Firstly, the SIFT 
is computed in the gradient space by applying the histogram 
of gradient orientation for feature representation which only 
includes main shape information of the image. The LCCD 
is able to explore meaningful color information as an impor¬ 
tant complementary feature that enriches the representation. 
Secondly, the LCCD computes multiple contrastive features 
both spatially and between multiple color channels, making 
it capable of encoding more meaningful local contextual and 
underlying structural information than the SIFT, which does 
not consider local spatial relationships (e.g. local contrast) be¬ 
tween neighboring regions at all. This may lead to a significant 
information loss of the SIFT. We will show experimentally 
that both advanced properties of the LCCD provide strong 
complementary to SIFT descriptor for image classification. 

We further show the insights of the proposed contrastive 
mechanism by connecting it with recent Deep Convolutional 
Neural Networks (DCNN) (TJ. Our observation can be verified 
by recent success of the DCNN, which shows that both color 
contrast and edge information are the main low-level features 
for image description, as indicated in Figure 1. Intuitively, the 
design of our color contrastive mechanism is closely related to 
the structures of low-level filters (from the first convolutional 
layer) learned by the DCNN. As shown in Figure 1, there 
are mainly two types of the low-level filters, and one of 
them intuitively corresponds to our color contrast mechanism 


(mainly on the bottom part) as follow. First, some filters 
are mainly displayed in a single color. It means that their 
weights are varied largely between image channels, but are 
changed slightly within each independent channel. Hence, 
they are able to capture the contrastive characteristics between 
different color channels. This mechanism is relatively closed 
to the pipeline of our channel-contractive feature. Second, 
some other filters are displayed in two or multiple colors, 
indicating that their weights are changed significantly both 
spatially and between channels. Therefore, they are able to 
detect the contrastive information from both aspects, which are 
similar to both of our contrastive descriptors. These intuitive 
low-level connections between the LCCD and DCNN provides 
a strong theoretical support to the proposed color constrictive 
mechanism. 


IV. Experimental Results and Discussions 

The performance of the LCCD was evaluated on three 
challenging benchmark databases for image classification 
and scene categorization: the MIT Indoor-67 database ED, 
SUN397 J32] and PASCAL VOC 2007 standardslH. We 
compare the performance of the LCCD and its combination 
with SFIT against recent results on three databases. 

In all our experiments, we resize the input image into 
470 x 380. Each image is divided into 50 x 50 regions (cells). A 
LCCD feature vector is extracted from an image patch with the 
size of 3x3 regions. The LCCD features are computed densely 
by moving a patch window with the size of 3 x 3 regions 
through all the 50 x 50 divided regions. Finally, we get 48 x 48 
LCCD feature vectors from an image. Each LCCD feature 
vector is computed as follow. First, a 20-bin histogram feature 
is extracted from each region for computing the color contrast. 
Second, the subspace scheme is adopted, and the 20-bin 
histogram is decomposed into 18 subspaces or sub histograms 
by using the subspace window with size (length) of 3. Third, 
we compute a contrastive value from each pair of subspaces 
by using Eq. ( 6 ) or Eq. (7), and then get an 18D contrastive 
vector from each pair of considered regions. Fourth, for a 
defined image patch, we compute the contrastive vectors from 
all possible region pairs, and generate the final spatially- and 
channel-contrastive features with dimensions of 18 x 8 = 144, 
and 18 x 9 = 162, respectively. Then we further reduce the 
dimensions of both contrastive features to 80 by using PCA 
HD, and finally generate LCCD S , LCCD c € M 80 * 48x48 for 
an image. 

We apply Fisher Vector (FV) encoding for both LCCD 5 
and LCCDc separately. We train a codebook with 256 centers 
using the Gaussian Mixture Model (GMM), and encode the 
generated LCCD 5 or LCCDc vectors with the BoW model 
149! . l50l . l42l . The final LCCD descriptor combines both 
LCCDc,' and LCCDc- For SIFT, we used the VLFeat iBTfl 
library to extract SIFT descriptor Ifl4l with 128 dimensions 
for each patch. Similarity, they are reduced into 80D by using 
PCA l48l . and then are also encoded with the BoW model 
with 256 centers. 
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Fig. 5. Confusion matrices for the SIFT (left), and LCCD+SIFT (right) on 
ten categories. 

A. On the MIT Indoor-67 Database 

We evaluate the performance of the LCCD on the task of 
indoor scene recognition. The experiments were conducted on 
the large-scale MIT Indoor-67 database ED. which contains 
67 classes and total 15,620 images. The number of images 
varies across categories, but at least 100 images are included 
in each category. The numbers of training and testing images 
are 80 and 20 per category, respectively. 

TABLE II 

Comparisons of the LCCD with the-state-of-the-art 
DESCRIPTORS WITHOUT FISHER VECTOR ENCODING ON THE MIT 
INDOOR-67. 


Method 

Publication 

Accuracy(%) 

Quattoni et.al.13W 

CVPR2009 

26.00 

Li et.al. 1381 

NIPS2010 

37.60 

Wang et.al. 1521 

CVPR2010 

54.62 

SIFT £0} 

IJCV2004 

51.85 

Color SIFT (28) 

TPAMI2010 

56.10 

BoP+SIFT(Juneja et.al. { 39j) 

CVPR2013 

56.66 

LCCD 

- 

20.36 

LCCD+SIFT 

- 

57.42 


TABLE III 

Comparisons of the LCCD with the-state-of-the-art 
DESCRIPTORS WITH FISHER VECTOR ENCODING ON THE MIT 
INDOOR-67. 


Method 

Publication 

Accuracy(%) 

Kobayashi et.al. 1531 

CVPR2013 

58.91 

Doersch et.al. 1541 

NIPS2013 

64.03 

SIFT (Jorge et.al.) (42) 

ECCV2010 

62.16 

Color SIFT (21 

TPAMI2010 

64.22 

BoP+SIFT(Juneja et.al. (39)) 

CVPR2013 

63.10 

OPM+SIFT(Xie et.al. (40)) 

CVPR2014 

63.48 

LCCD 

- 

36.43 

LCCD+SIFT 

- 

65.96 


It has been verified that the FV encoding can improve the 
performance considerably on this database. For a fair compar¬ 
ison, we conducted two groups of experiments separately by 
evaluating the methods with and without the FV encoding. The 
results of them are presented in Table 2 and 3, respectively. 

As can be found, our LCCD descriptor combined with 
SIFT achieves the best performance in both cases. In the 


case of with FV encoding, the LCCD+SIFT achieves clas¬ 
sification accuracy at 65.96%, which surpasses the closest 
result achieved by the OPM+SIFT HOl l by a large margin of 
about 2.5%. Obviously, it improves the performance of only 
SIFT substantially in both cases, with improvements at about 
6 % and 4% for without and with VF. These improvements 
are considerably larger than those done by recent proposed 
Bag-of-Parts (BoP) (39l and Orientational Pyramid Matching 
(OPM) l40l descriptors, which improve SIFT with 0.94% and 
1.32% respectively in with FV case. It clearly indicates that 
our color contrastive feature provides stronger complementary 
information for SIFT than the BoP and OPM methods. Beside, 
we also compared our descriptor against the color SIFT and 
obtained about 1.5% improvements in both cases, demon¬ 
strating that our measure of color in contrast mechanism is 
more efficient than the color feature applied in the color SIFT 
descriptor. 

To find more detailed linkage between the LCCD and 
SIFT, we construct two confusion matrices for SIFT and 
SIFT+LCCD by using ten categories: bakery, concerthall, 
dentalof fice, dingingroom, hairsalon, hospitalroom, 
fast foodrestaurant, office, livingroom and lockerroom. 
The two matrices are shown in Figure 5. Obviously, the values 
of diagonal elements in the SIFT+LCCD matrix are signifi¬ 
cantly larger than those in the single SIFT matrix. For example, 
the accuracies of the concerthall and f astf oodrestaurant 
increase substantially: 65.16% —> 82.04% and 56.10% —> 
74.81% respectively, indicating that the LCCD descriptor is 
greatly complementary to SIFT for image description. 

TABLE IV 

Classification errors within paired categories by single LCCD 
or SIFT or combination of them. 


category A 

category B 

SIFT 

LCCD+SIFT 

studionmusic 

tvstudio 

10.53% 

5.27% 

restaurant 

bar 

5% 

0 

poolinside 

airportinside 

10 % 

5% 

clothingstore 

bedroom 

5.56% 

0 

corridor 

stairscase 

9.52% 

4.76% 

category A 

category B 

LCCD 

LCCD+SIFT 

gym. 

closet 

5.56% 

0 

hairsalon 

greenhouse 

4.76% 

0 

jewellery 

mail 

9.09% 

4.55% 

meetingroom 

classroom 

13.64% 

4.55% 

restaurant 

buffet. 

5% 

0 


To further evaluate efficiency of the LCCD, we select 
several pairs of categories which are difficult to be classified 
correctly by either the single SIFT or LCCD. The error rates 
by each of them, and their combination are listed in Table 
4. It can be found clearly that the error rates are reduced 
largely (about 5%) by the combination of them, some of 
which achieve perfect performance with zero errors, further 
indicating that the LCCD and SIFT compensate well for each 
other. As a better demonstration, we also present a number 
of example images categorized by SIFT and LCCD+SIFT in 
Figure 6. The improvements by our descriptor are obvious 
again. Most incorrect categorizations by our descriptor are 
acceptable, since most of these cases are even hard to be 
separated correctly by our human, such as livingroom and 
waitingroom, movietheater and concerthall. 
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Fig. 6. Image samples from categorization results by the SIFT and LCCD+SIFT. The name on top of each image denotes the ground truth category. Images 
from left to right are sorted by their precision scores in decreasing order from 10th to 15th. Images with top 9 precision scores are almost correctly classified. 
Images with incorrect classification are labeled by red boundary boxes. 


B. On the SUN397 

The performance of the proposed LCCD descriptor was 
evaluated on the SUN397 (32). The database has 397 different 
scene classes, which is probably the largest database for 
scene classification until now. It includes 108,754 images in 
total. The number of images varies across classes, and at 
least 100 images are included in each class. Our experiments 
follow previous work m, ed by using a subset of the 
dataset, which has 50 training and 50 testing images per class, 
averaging over 10 partitions. 

TABLE V 

Comparisons of the LCCD with the-state-of-the-art 
DESCRIPTORS WITH FISHER VECTOR ENCODING ON THE SUN397. 


Method 

Publication 

Accuracy(%) 

Xiao et.al. 1321 

CVPR2010 

38.00 

DeCAF (T| 

ICML2014 

40.94 

SIFTfJorge et.al.) 02] 

IJCV2013 

43.02 

LCS+SIFT(Jorge et.al.) (42) 

IJCV2013 

47.20 

OPM+SIFT (Xie et.al. (40)) 

CVPR2014 

45.91 

LCCD 

- 

20.29 

LCCD+SIFT 

- 

49.68 


The performance of the LCCD descriptor was evaluated 
with the FV encoding by comparing it with recent results in 
the SUN397 database. As shown in Table 5, LCCD+SIFT 
descriptor achieves the highest mean accuracy at 49.68%, 
which largely improves the performance of single SIFT with 
more than 6.5%. The improvement is more significant than 


the most recent combination methods by the OPM+SIFT (at 
45.91%) ED and LCS+SIFT (at 47.20%) (42). Several image 
categories with top improvements by our combined descriptor, 
compared to the single SIFT, are presented in Figure 7. It can 
be found that our descriptor boosts the performance of the 
SIFT substantially, with improvements of 30% in swimming 
pool outdoor and 20% in thrifshop categories. In the right of 
the Figure 7, we list a number of categories which have very 
similar global structures to the left ones, making them difficult 
to be discriminated correctly. Such confused categories com¬ 
monly exist in scene recognition, some of them in the MIT 
Indoor-67 are also shown in Figure 6. Our improvements on 
these categories demonstrate that our color descriptor is able 
to capture more local detailed features which are crucial to 
identify these ambiguous categories. 

Furthermore, we notice that the proposed LCCD+SIFT 
descriptor also obtains large improvement (about 9%) over 
recent result of the DeCAF m, which is one of the most 
advanced deep learning models. These results convincingly 
verify the effectiveness of the proposed LCCD. Deep learning 
models have shown strong capability for image representation. 
However, the high-level deep features computed via multi¬ 
layer feature extraction are highly abstracted. They may lose 
important local detailed information in fully-connected layers, 
leading to the lower discrimination of the features. 

C. On the PASCAL VOC 2007 Standards 

We further evaluate the performance of LCCD descriptor 
on the PASCAL VOC 2007 standards |33l for visual object 
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swimming pool outdoor (+30%) swimming pool indoor 



thrifshop (+20%) 


gift_shop 





badmintor court indoor 


bakery shop 


volleyball court indoor 


delicatessen 


Fig. 7. Left: Image classes from the SUN397 where the LCCD+SIFT 
achieves large improvements over SIFT; Right: Image classes which are easily 
confused with the left ones. 


categorization. The PASCAL VOC 2007 standards ll33l is 
known as one of the most difficult image classification tasks 
due to large-scale variations in appearance, posture, and even 
with occlusions, which are often caused by real-world com¬ 
plicated affects. In the table 6, we compare the classification 
accuracies of related descriptors. All results are achieved with 
FV encoding, except for color SIFT (cited from lf28l l). Again, 
the LCCD+SIFT achieves the highest accuracy at 65.80%, 
improving the performance of individual SIFT (with the FV) 
with 4%. The improvement is more significant than that of 
the LCS+SIFT descriptor f42l . The advantage of the color 
contrastive information is obvious again. 


recent combination descriptors (e.g. the LCS+SIFT ll42l and 
OPM+SIFT l40l ) considerably, indicating that the proposed 
LCCD provides stronger complementary properties to the 
SIFT than the others. The LCCD is capable of capturing 
meaningful local detailed features, which are often discarded 
by most gradient based descriptors and DCNN models. Third, 
our computation of color contrast in both spatial locations 
and multiple channels achieves better performance than cur¬ 
rent color SIFT and LCS descriptors, demonstrating that our 
contrastive mechanisms provides a more principled approach 
for measuring local color information. 

V. Conclusions 

We have presented a simple yet powerful local descriptor, 
local color contrastive descriptor (LCCD), for image classifi¬ 
cation. Beyond traditional shape based descriptor, the neural 
mechanisms of color contrast was introduced to enrich the 
image representation with color information in multimedia 
and computer vision communities. We developed a novel 
contrastive mechanism to compute the color contrast in both 
spatial locations and multiple color channels, and successfully 
applied it for detecting meaningful local structures of the im¬ 
ages. We verified its efficiency both theocratically and experi¬ 
mentally, and demonstrated its strong ability to compensate for 
the SIFT feature for image description. Extensive experimental 
results show that the proposed LCCD descriptor with SIFT 
substantially improves the performance of individual SIFT, and 
achieves the-state-of-the-art performance in three benchmarks, 
verifying its efficiency convincingly and confidently. 
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